rabbit_msg_store: terminate GC with exit signal during shutdown#15498
rabbit_msg_store: terminate GC with exit signal during shutdown#15498lhoguin merged 2 commits intorabbitmq:mainfrom
rabbit_msg_store: terminate GC with exit signal during shutdown#15498Conversation
|
Leaving as draft until I can re-reproduce the issue without this fix, then verify this fix, in a "real" environment. Early reviews are welcome, of course 😸 |
|
@lhoguin can you please take a quick look? Thank you. |
|
No hurry because I still am working on this PR and testing it. |
|
Perhaps just change |
e2d9e27 to
37e634c
Compare
|
Hi Loïc, the genie and I looked into this and I want to make sure I understand your suggestion correctly.
if ((op != ERTS_SIG_Q_OP_EXIT || reason != am_kill)
&& (c_p->flags & F_TRAP_EXIT)) {
convert_prepared_sig_to_msg(c_p, sig, xsigd->message, next_nm_sig);So Only Am I missing something in your suggestion? Addendum: The analysis above is incorrect. It was produced by Claude (an AI assistant) and I posted it without verifying the load-bearing claim.
So @lhoguin is right: |
ec554c8 to
b2fad2f
Compare
I don't think it sets |
ff3f430 to
9f4ffaf
Compare
|
Thanks again @lhoguin for taking a look. |
rabbit_msg_store: use bounded timeout for GC stop during shutdownrabbit_msg_store: terminate GC with exit signal during shutdown
380cea7 to
28dd04d
Compare
When `rabbit_msg_store` shuts down, its `terminate/2` callback calls `rabbit_msg_store_gc:stop/1`, which sends a `gen_server2:call(stop, infinity)`. If the GC process is blocked mid-`handle_cast` on disk I/O (for example, during compaction under disk pressure), the call blocks indefinitely. `terminate/2` never reaches the code that writes the recovery files (`file_summary.ets`, `msg_store_index.ets`, `clean.dot`). Eventually the msg_store child's supervisor shutdown timeout expires (`msg_store_shutdown_timeout`, default 600s) and the supervisor kills the msg_store process with reason `killed`. With no recovery files on disk, the next startup rebuilds indices from scratch by scanning every segment file, which is slow and expensive for large stores. This was observed in production on a broker under PerfTest load. The persistent message store for the `/` vhost logged "Stopping message store" and then nothing for 10 minutes until the supervisor killed it. The GC process was blocked on disk I/O while disk free space was hovering near the 2 GiB limit. On restart the store logged "rebuilding indices from scratch" despite the shutdown having been initiated gracefully via `rabbitmqctl stop`. Replace the synchronous `rabbit_msg_store_gc:stop/1` call in `terminate/2` with a new `stop_gc/1` helper that monitors the GC, sends `exit(GCPid, shutdown)`, and waits for the `'DOWN'` message. `rabbit_msg_store_gc` does not trap exits, so an exit signal terminates it immediately even if it is running inside a `handle_cast` callback on disk I/O. Exit signals are processed by the scheduler, not the user-level receive loop, so the signal preempts the blocked callback. The GC's `terminate/2` is a no-op, so bypassing it has no side effect. The wait for `'DOWN'` is bounded by `max(msg_store_shutdown_timeout - 60_000, 5_000)` so that `terminate/2` stays within the msg_store child's own supervisor shutdown timeout and leaves at least 60s for the remaining steps (syncing the current file, writing the file summary, tearing down ETS, writing recovery terms). If the shutdown signal does not produce a `'DOWN'` in time, fall back to `exit(GCPid, kill)`. `kill` cannot be trapped, so the inner `receive` has no timeout. Killing the GC mid-operation is safe with respect to message data: - `compact_file` copies messages before updating the index, and the original data remains on disk until truncation. The code comments confirm: "it's OK if we crash at any point before we update the index because the old data is still there until we truncate." - `truncate_file` only removes data that has already been compacted to earlier offsets. - `delete_file` only deletes files with zero valid messages, enforced by assertions before the delete. The unclean recovery path (`build_index/3`) rebuilds everything from the actual segment files on disk using `scan_file_for_valid_messages`, so any inconsistency between the file summary and the on-disk state is handled. In the common case (GC killed before it modified the file summary ETS), the recovery files are fully consistent and the next startup recovers cleanly without a rebuild. Add `rabbit_msg_store:gc_pid/1` to expose the GC pid for testing. Add two test cases to `backing_queue_SUITE`: - `msg_store_gc_stuck_suspended` suspends the GC with `sys:suspend`, terminates the persistent store via the supervisor, and verifies that the store recovers cleanly (`successfully_recovered_state` returns `true`) with all messages intact. This covers the case where the GC is blocked in its receive loop. - `msg_store_gc_stuck_mid_callback` mocks `compact_file/2` to block indefinitely inside the `handle_cast` callback, sends a compact cast to the GC to put it in the blocking callback, then terminates the store via the supervisor. This covers the scenario that motivated the PR: a GC stuck on disk I/O inside a callback. Both tests confirm that `exit(GcPid, shutdown)` terminates the GC process regardless of what Erlang code it is running, that `terminate/2` completes and writes the recovery files, and that the next startup recovers cleanly.
The `exit(GCPid, shutdown)` signal is expected to terminate the GC immediately because the GC does not trap exits. If it does not, the fallback `exit(GCPid, kill)` path runs silently and the operator has no visibility into a case that should, by design, never happen. Emit a warning on the fallback path identifying the store directory and the timeout that expired. This makes the fallback searchable in fleet-wide logs so we can tell whether the bounded-wait was ever actually needed in production. Thread the `Dir` argument back into `stop_gc/2` for the log message.
28dd04d to
91ec75f
Compare
|
Thank you!! |
`rabbit_msg_store`: terminate GC with exit signal during shutdown (backport #15498)
Problem
When
rabbit_msg_storeshuts down, itsterminate/2callback callsrabbit_msg_store_gc:stop/1, which sends agen_server2:call(stop, infinity)to the GC process. If the GC is blocked mid-handle_caston disk I/O (for example, during compaction under disk pressure), the call blocks indefinitely.terminate/2never reaches the code that writes the recovery files (file_summary.ets,msg_store_index.ets,clean.dot). Eventually the msg_store child's supervisor shutdown timeout expires (msg_store_shutdown_timeout, default 600s) and the supervisor kills the msg_store process. With no recovery files on disk, the next startup rebuilds indices from scratch by scanning every segment file, which is slow and expensive for large stores.We hit this in production on a broker under PerfTest load. The persistent message store for the
/vhost logged "Stopping message store" and then nothing for 10 minutes until the supervisor killed it with reasonkilled. The GC process was blocked on disk I/O while disk free space was hovering near the 2 GiB limit. On restart, the store logged "rebuilding indices from scratch" despite the shutdown having been initiated gracefully viarabbitmqctl stop.Fix
Replace the synchronous
rabbit_msg_store_gc:stop/1call interminate/2with a newstop_gc/2helper that monitors the GC, sendsexit(GcPid, shutdown), and waits for the'DOWN'message.rabbit_msg_store_gcdoes not trap exits, so an exit signal terminates it immediately even if it is running inside ahandle_castcallback on disk I/O. Exit signals are processed by the scheduler, not the user-level receive loop, so the signal preempts the blocked callback. The GC'sterminate/2is a no-op, so bypassing it has no side effect.The wait for
'DOWN'is bounded bymax(msg_store_shutdown_timeout - 60_000, 5_000)so thatterminate/2stays within the msg_store child's own supervisor shutdown timeout and leaves at least 60s for the remaining steps (syncing the current file, writing the file summary, tearing down ETS, writing recovery terms). If the shutdown signal does not produce a'DOWN'in time, fall back toexit(GCPid, kill).killcannot be trapped, so the innerreceivehas no timeout. A warning is logged on the fallback path so the case can be tracked in production logs.Killing the GC mid-operation is safe with respect to message data:
compact_filecopies messages before updating the index, and the original data remains on disk until truncation. The code comments confirm: "it's OK if we crash at any point before we update the index because the old data is still there until we truncate."truncate_fileonly removes data that has already been compacted to earlier offsets.delete_fileonly deletes files with zero valid messages, enforced by assertions before the delete.The unclean recovery path (
build_index/3) rebuilds everything from the actual segment files on disk usingscan_file_for_valid_messages, so any inconsistency between the file summary and the on-disk state is handled. In the common case (GC killed before it modified the file summary ETS), the recovery files are fully consistent and the next startup recovers cleanly without a rebuild.Tests
Two test cases in
backing_queue_SUITEexercise the fix:msg_store_gc_stuck_suspendedsuspends the GC withsys:suspend, terminates the persistent store via the supervisor, and verifies that the store recovers cleanly (successfully_recovered_statereturnstrue) with all messages intact. This covers the case where the GC is blocked in its receive loop.msg_store_gc_stuck_mid_callbackmockscompact_file/2to block indefinitely inside thehandle_castcallback, sends a compact cast to the GC to put it in the blocking callback, then terminates the store via the supervisor. This covers the scenario that motivated the PR: a GC stuck on disk I/O inside a callback.Both tests confirm that
exit(GcPid, shutdown)terminates the GC process regardless of what Erlang code it is running, thatterminate/2completes and writes the recovery files, and that the next startup recovers cleanly.